1. Problem Definition

The Sales and Marketing team of Kira Plastinina would like to understand their customer’s behavior from data that they have collected over the past year. More specifically, they would like to learn the characteristics of customer groups.

Metrics of success

The analysis will be successful if I can identify trends in groups of users and come up with a model that can take that into account to identify user that will bring revenue to the company or not

context

About Kira Plastinina is a Russian brand that is sold through a defunct chain of retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines, and Armenia.

2. Data Sourcing

The data was collected through various means:

3. Checking the data

library('data.table')
library('tidyverse')
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
customers <- fread('http://bit.ly/EcommerceCustomersDataset')
head(customers)
tail(customers)
str(customers)
## Classes 'data.table' and 'data.frame':   12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 1 2 3 ...
##  $ ProductRelated_Duration: num  0 64 -1 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : chr  "Feb" "Feb" "Feb" "Feb" ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : chr  "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, ".internal.selfref")=<externalptr>

4. Cleaning the data

Checking for null values

colSums(is.na(customers))
##          Administrative Administrative_Duration           Informational 
##                      14                      14                      14 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                      14                      14                      14 
##             BounceRates               ExitRates              PageValues 
##                      14                      14                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0

14 null rows in multiple columns were found

customers[is.na(customers$Administrative)==TRUE]

Null values exist in the same rows and will be eliminated

dim(customers)
## [1] 12330    18
dim(na.omit(customers))
## [1] 12316    18
customers <- na.omit(customers)

Checking for duplicates

dim(customers[duplicated(customers)])
## [1] 117  18

117 duplicated variables and they will be eliminated

customers <- customers[!duplicated(customers)]

Getting a summary of the dataset

summary(customers)
##  Administrative  Administrative_Duration Informational    
##  Min.   : 0.00   Min.   :  -1.00         Min.   : 0.0000  
##  1st Qu.: 0.00   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.00   Median :   9.00         Median : 0.0000  
##  Mean   : 2.34   Mean   :  81.68         Mean   : 0.5088  
##  3rd Qu.: 4.00   3rd Qu.:  94.75         3rd Qu.: 0.0000  
##  Max.   :27.00   Max.   :3398.75         Max.   :24.0000  
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :  -1.00        Min.   :  0.00   Min.   :   -1.0        
##  1st Qu.:   0.00        1st Qu.:  8.00   1st Qu.:  193.6        
##  Median :   0.00        Median : 18.00   Median :  609.5        
##  Mean   :  34.84        Mean   : 32.06   Mean   : 1207.5        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1477.6        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##   BounceRates        ExitRates         PageValues        SpecialDay     
##  Min.   :0.00000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.01422   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.00293   Median :0.02500   Median :  0.000   Median :0.00000  
##  Mean   :0.02045   Mean   :0.04150   Mean   :  5.952   Mean   :0.06197  
##  3rd Qu.:0.01667   3rd Qu.:0.04848   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.20000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##     Month           OperatingSystems    Browser           Region     
##  Length:12199       Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mode  :character   Median :2.000    Median : 2.000   Median :3.000  
##                     Mean   :2.124    Mean   : 2.358   Mean   :3.153  
##                     3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##                     Max.   :8.000    Max.   :13.000   Max.   :9.000  
##   TrafficType     VisitorType         Weekend         Revenue       
##  Min.   : 1.000   Length:12199       Mode :logical   Mode :logical  
##  1st Qu.: 2.000   Class :character   FALSE:9343      FALSE:10291    
##  Median : 2.000   Mode  :character   TRUE :2856      TRUE :1908     
##  Mean   : 4.075                                                     
##  3rd Qu.: 4.000                                                     
##  Max.   :20.000

assigning column values to variables for ease of use

admin                   <-customers$Administrative
admin_duration          <-customers$Administrative_Duration
info                    <-customers$Informational
info_duration           <-customers$Informational_Duration
product                 <-customers$ProductRelated
product_related_duration<-customers$ProductRelated_Duration
bounce_rates            <-customers$BounceRates
exit_rates              <-customers$ExitRates
page_values             <-customers$PageValues
special_day             <-customers$SpecialDay
month                   <-customers$Month
os                      <-customers$OperatingSystems
browser                 <-customers$Browser
region                  <-customers$Region
traffic_type            <-customers$TrafficType
visitor_type            <-customers$VisitorType
weekend                 <-customers$Weekend
revenue                 <-customers$Revenue

Checking for Outliers in the dataset

out <- boxplot(admin)$out

print(paste(length(out),'outliers'))
## [1] "404 outliers"
out <- boxplot(admin_duration)$out

print(paste(length(out),'outliers'))
## [1] "1147 outliers"
out <- boxplot(info)$out

print(paste(length(out),'outliers'))
## [1] "2630 outliers"
out <- boxplot(info_duration)$out

print(paste(length(out),'outliers'))
## [1] "2437 outliers"
out <- boxplot(product)$out

print(paste(length(out),'outliers'))
## [1] "1007 outliers"
out <- boxplot(product_related_duration)$out

print(paste(length(out),'outliers'))
## [1] "951 outliers"
out <- boxplot(bounce_rates)$out

print(paste(length(out),'outliers'))
## [1] "1431 outliers"
out <- boxplot(exit_rates)$out

print(paste(length(out),'outliers'))
## [1] "1327 outliers"
out <- boxplot(page_values)$out

print(paste(length(out),'outliers'))
## [1] "2730 outliers"
out <- boxplot(special_day)$out

print(paste(length(out),'outliers'))
## [1] "1249 outliers"
out <- boxplot(os)$out

print(paste(length(out),'outliers'))
## [1] "107 outliers"
out <- boxplot(browser)$out

print(paste(length(out),'outliers'))
## [1] "4321 outliers"
out <- boxplot(region)$out

print(paste(length(out),'outliers'))
## [1] "505 outliers"
out <- boxplot(traffic_type)$out

print(paste(length(out),'outliers'))
## [1] "2083 outliers"

While there were many outliers in the dataset, removing them would affect the data, since the users are unique users from different countries with different norms, eliminating these would have a bias against users not withing the ‘normal’ range

EDA

get.mode <- function(v){
  uniq <- unique(v)
  # gets all the unique values in the column
  # match (v, uniq) matches a value to the unique values and returns the index
  # tabulate (match (v, uniq)) takes the values in uniq and counts the number of times each integer occurs in it.
  # which.max() gets the index of the first maximum in the tabulated list
  # then prints out the uniq value
  uniq[ which.max (tabulate (match (v, uniq)))]
}
# lazy method of performing univariate analysis
univar <- function (column,dataset,plot='hist'){
  subject <- dataset[[column]] # Get the subject of the analysis
  # Check the plot specified if none specified default is a histogram
if (plot == 'category'){
# For categorical data best tto plot barplots for each category
    ggplot(dataset,aes(subject))+ geom_bar(fill='#222222')

  }else{ #if not categorical continue here
  if (plot == 'hist') { # if histogram plot a histogram
    plt <- geom_histogram(fill = "#222222", colour = "#038b8d")
  }
  else if (plot == 'density'){ # if density plot a density plot
    plt <- geom_density()
  }

  print(paste('mean : ',mean(subject)))
  print(paste('mode : ',get.mode(subject)))
  print(paste('median : ',median(subject) ))
  print(paste('max : ', max(subject),' min :', min(subject)))
  print(paste('quantile 5%: ', quantile(subject,probs=c(0.05)),'quantile 95%: ', quantile(subject,probs=c(0.95)) ))
  print(paste('standard deviation :', sd(subject)))

  ggplot(dataset,aes(subject)) + plt
}
}
univar('Revenue',customers,'category')

Most users would not bring revenue

univar('Administrative',customers,'category')

Most of users did not get into administrative pages

univar('Administrative_Duration',customers)
## [1] "mean :  81.6821434502838"
## [1] "mode :  0"
## [1] "median :  9"
## [1] "max :  3398.75  min : -1"
## [1] "quantile 5%:  0 quantile 95%:  352.23428568"
## [1] "standard deviation : 177.528167792632"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The range of time spent on administrative sites was from -1 to 3398.75 minutes. For most people spent between 0 and 352 minutes in the administrative sites The average time spent on the admin sites was 81.68 minutes but given most peple did not get into administrative pages alot of users spent 0 minutes in the pages. Graph is also skewed to the right

univar('Informational',customers,'category')

Most of users did not get into administrative pages

univar('Informational_Duration',customers)
## [1] "mean :  34.8373363758361"
## [1] "mode :  0"
## [1] "median :  0"
## [1] "max :  2549.375  min : -1"
## [1] "quantile 5%:  0 quantile 95%:  199"
## [1] "standard deviation : 141.458498734393"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The range of time spent on informational sites was from -1 to 2549 minutes. For most people spent between 0 and 199 minutes in the informational sites The average time spent on the informational sites was 81.68 minutes but given most peple did not get into informational pages alot of users spent 0 minutes in the pages. Graph is also skewed to the right, meaning more people are found to the first parts of the graph

univar('ProductRelated',customers)
## [1] "mean :  32.0584474137224"
## [1] "mode :  1"
## [1] "median :  18"
## [1] "max :  705  min : 0"
## [1] "quantile 5%:  2 quantile 95%:  110"
## [1] "standard deviation : 44.6009113793394"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The average number of product pages visited was 32, with most people visiting 1 page the number of pages visited ranged from 0 to 705 pages and most people visited between 2 and 110 pages. Graph is also skewed to the right, meaning more people are found to the first parts of the graph

univar('ProductRelated_Duration',customers)
## [1] "mean :  1207.50818855772"
## [1] "mode :  0"
## [1] "median :  609.5416667"
## [1] "max :  63973.52223  min : -1"
## [1] "quantile 5%:  0 quantile 95%:  4313.4514098"
## [1] "standard deviation : 1919.92747174039"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The range of time spent on product related sites was from -1 to 63973.52 minutes. For most people spent between 0 and 4313.45 minutes in the product related sites The average time spent on the product related pages was 1207.51 minutes but given most people visited 1 product related page alot of users spent 0 minutes in the pages. Graph is also skewed to the right

univar('BounceRates',customers)
## [1] "mean :  0.0204467350769735"
## [1] "mode :  0"
## [1] "median :  0.002930403"
## [1] "max :  0.2  min : 0"
## [1] "quantile 5%:  0 quantile 95%:  0.15"
## [1] "standard deviation : 0.0454025027384669"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bounce rates ranged from 0 and 0.2, and most users lie between 0 and 0.15 The average bounce rate was at 0.02, most users had an bounce rate of 0. the graph is mostly skewed to the right

univar('ExitRates',customers)
## [1] "mean :  0.0414967835993114"
## [1] "mode :  0.2"
## [1] "median :  0.025"
## [1] "max :  0.2  min : 0"
## [1] "quantile 5%:  0.004545455 quantile 95%:  0.175"
## [1] "standard deviation : 0.046247160452655"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exit rates ranged from 0 and 0.2, and most users lie between 0.0045 and 0.175 The average exit rate was at 0.0415, most users had an exit rate of 0.2. the graph is mostly skewed to the right

univar('PageValues',customers)
## [1] "mean :  5.95250015960423"
## [1] "mode :  0"
## [1] "median :  0"
## [1] "max :  361.7637419  min : 0"
## [1] "quantile 5%:  0 quantile 95%:  38.311637762"
## [1] "standard deviation : 18.6577915027872"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Alot of users visited 0 pages before completing an transaction, while the average pages visited was 5. The number of pages visited before making a transaction ranged from 0 to 361 pages. Graph is also skewed to the right, meaning more people are found to the first parts of the graph

univar('SpecialDay',customers,'category')

Most people did not visit the site close to special days

univar('OperatingSystems',customers,'category')

Most users visiting the site used the Operating system encoded as 2. Operating systems 1,2 and 3 were the most commonly used

univar('Browser',customers,'category')

Browsers 2 were the most commonly used, However browser 1 was also highly used.

univar('Region',customers,'category')

Most users were from region 1 and 2 respectively Region 5 had the least number of users.

univar('TrafficType',customers,'category')

Most users had traffic type 2,1 and 3 respectively, with few users with traffic 16,17 and 18

ggplot(customers,aes(customers$Administrative,fill=customers$Revenue))+ geom_bar()

few users who had a low count of administrative pages had led to revenue to the site, with most people with a count of 0 leading to no revenue

ggplot(customers,aes(admin_duration,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

few users who had a low duration in administrative pages had led to revenue to the site, with most people with a count of 0 leading to no revenue

ggplot(customers,aes(info,fill=revenue))+ geom_bar()

few users who had a low count of informational pages had led to revenue to the site, with most people with a count of 0 leading to no revenue

ggplot(customers,aes(info_duration,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

few users who had a low duration in informational pages had led to revenue to the site, with most people with a count of 0 leading to no revenue.

ggplot(customers,aes(product,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

few users who had a low count of product pages had led to revenue to the site, with most people leading to no revenue

ggplot(customers,aes(product_related_duration,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

few users who had a low duration in product pages had led to revenue to the site, with most people leading to no revenue

ggplot(customers,aes(bounce_rates,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

people with higher bounce rates tend to not lead to revenue compared to a high number of users leading to revenue with low bounce rates

ggplot(customers,aes(exit_rates,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

people with higher exit rates tend to not lead to revenue compared to a high number of users leading to revenue with low exit rates

ggplot(customers,aes(page_values,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

most users with a low page value led to no revenue, but as the page values the users led to revenue

ggplot(customers,aes(special_day,fill=revenue))+ geom_bar()

Fewer people visited the pages as it got closer to special days. The most revenue was seen on days not close to special days

ggplot(customers,aes(os,fill=revenue))+ geom_bar()

ggplot(customers,aes(browser,fill=revenue))+ geom_bar()

Since most users used browser 2 there were more people leading to revenue than other browsers

ggplot(customers,aes(region,fill=revenue))+ geom_bar()

Users from region 3 were the second most but had the better ratio of revenue and non revenue users compared to region 1

ggplot(customers,aes(traffic_type,fill=revenue))+ geom_bar()

Users using traffic types 1 to 4 had the highest revenue customers with traffic type 4 having the best ratio of revenue and non revenue

ggplot(customers,aes(page_values,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

few users who had a low count of pages values had led to revenue to the site, with most people leading to no revenue

ggplot(customers,aes(admin_duration,info_duration))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

Users with less administrative duration tend to spend more time on informational pages. As administrative duration increases users tend to spend less time on informational pages.

ggplot(customers,aes(admin_duration,info_duration,color=revenue))+ geom_point(alpha=0.5)

users who led to a revenue had lower durations in administrative pages

ggplot(customers,aes(admin_duration,product_related_duration))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

As administrative duration increases users tend to spend more time on product pages.

ggplot(customers,aes(admin_duration,product_related_duration,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

Users leading to a revenue had a good blend of both tiome on administrative and product

ggplot(customers,aes(admin_duration,bounce_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

Most users had really low bounce rates and durations in administrative pages

ggplot(customers,aes(admin_duration,bounce_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

users who led to revenue had an increase in bounce rates as their duration in administrative pages increased while users leading to no revenue had a decrease in bounce rates as their duration in administrative pages increased

ggplot(customers,aes(admin_duration,exit_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

Exit rates decreased as user duration in administrative pages increased

ggplot(customers,aes(admin_duration,exit_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

Exit rates for user leading to no revenue decreased as their duration in administrative pages increased while the exit rates of thos leading to revenue remained fairly constant

ggplot(customers,aes(admin_duration,page_values))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

Users had fairly low page values and durations in administrative pages

ggplot(customers,aes(admin_duration,page_values,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

Page values of users leading to revenue tended to decrease as their duration in administrative pages increased whilr those of users not reading to revenue remained really low

ggplot(customers,aes(exit_rates,bounce_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

The general trend is as bounce rates increase so do exit rates

ggplot(customers,aes(exit_rates,bounce_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

While both customers leading to and not to a revenue have bounce rates increase with exit rates, those leading to a revenue had a slower rate of increment.

ggplot(customers,aes(product_related_duration,bounce_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

Bounce rates tend to increase with an increase in product related duration.

ggplot(customers,aes(product_related_duration,bounce_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

Bounce rates tend to increase with an increase in product related duration, for both users leading to revenue and not.

ggplot(customers,aes(product_related_duration,exit_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x

exit rates tend to decrease with an increase in product related duration.

ggplot(customers,aes(product_related_duration,exit_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x

Exit rates tend to remain fairly constant for users leading to revenue while the others tend to have exit rates dropping with increase in product related duration

label_encode <- function (column,dataframe){
  data <- dataframe[[column]]
  new_data <- as.integer(factor(data))
  # print(new_data)
  return (new_data)
}
users <- customers

users$Month <- label_encode('Month',users)
users$VisitorType <- label_encode('VisitorType',users)
users$Weekend <- label_encode('Weekend',users)
users$Revenue <- label_encode('Revenue',users)
get_lower_tri<-function(cor_mat){
    cor_mat[upper.tri(cor_mat)] <- NA
    return(cor_mat)
}

cor_mat <- round(cor(users),4)
lower <- get_lower_tri(cor_mat)
melted <- melt(lower, na.rm = TRUE)
## Warning in melt(lower, na.rm = TRUE): The melt generic in data.table has
## been passed a matrix and will attempt to redirect to the relevant reshape2
## method; please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(lower). In the next version, this warning will become an error.
head(melted)
ggplot(melted, aes(Var1, Var2))+  geom_tile(aes(fill = value), colour = "white") +
  scale_fill_gradient2(low = "#222222", high = "#1abc9c", mid = "#222222")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
There was a fair relationship between revenue and page values The other highly related variables that had strong relationships were ones that were derived from each other:

Multivariate

library('ClusterR')
## Loading required package: gtools
library('caret')
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
get_train_validation <- function (dataset){
  # Geting the row numbers for train sample (80% of the dataset)
  train <- sample(c(1:nrow(dataset)),size = ceiling(0.80*nrow(dataset)),replace = FALSE)
  # training set == part of the dataset in the train sample
  train_set <- dataset[train,]
  # Validation set == part of the dataset not in the train sample
  Validation_set <- dataset[-train,]
  # fix for R not accepting multiple argument returns
  sets <- list("Train" = train_set, "Validation" = Validation_set)
  return (sets)
}

normalize <- function(x){
  return ((x-min(x)) / (max(x)-min(x)))
}
users.norm <- users
for (name in colnames(users)){
  users.norm[[name]] <- normalize(users[[name]])
}
head(users.norm)

I decided to use pca to reduce the number of my columns

userd <- prcomp(users.norm[c(-18)], scale. = T)
screeplot(userd, type = "l", npcs = 15, main = "Screeplot of the first 10 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
       col=c("red"), lty=5, cex=0.6)

cumpro <- cumsum(userd$sdev^2 / sum(userd$sdev^2))
plot(cumpro[0:15], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 6, col="blue", lty=5)
abline(h = 0.88759, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC6"),
       col=c("blue"), lty=5, cex=0.6)

Pca reduction can effectively reduce dimensionality from 17 to 6

p_ca <- function (dataset){
sets <- get_train_validation(dataset)

pca.train <-sets$Train[,-c(18)]
pca.test <-sets$Validation[,-c(18)]

prin_comp <- prcomp(pca.train, scale. = T)

train.data <- data.frame(Revenue = sets$Train[[18]],prin_comp$x)
train.data <- train.data[,1:7]

test.data <- predict(prin_comp, newdata = pca.test)
test.data <- as.data.frame(test.data)
test.data$PC7 <-sets$Validation[[18]]

#test.data <- bind_cols(test.data, sets$Validation[[18]])

test.data <- test.data[,c(1:7)]

sets <- list("Train" = train.data, "Test" = test.data)
return (sets)
}

Implementing the solution

k_means <- function (train_set, Validation_set,n_start,max_iter,algorthm){

  # getting the sets
  x <- ncol(train_set)
  x1 <- x-1
  
  train_x <- train_set[,2:x]
  train_y <- train_set[[1]]

  validation_x <- Validation_set[,1:x1]
  validation_y <- Validation_set[[x]]

  model <- kmeans(train_x,2,nstart = n_start,iter.max = max_iter, algorithm=algorthm)
  pred <- predict_KMeans(validation_x, model$centers, threads = 1)
  if (pred[pred==2] > pred[pred==1]){
    pred[pred==2] <- 0
  }else{
    pred[pred==1] <- 0
    pred[pred==2] <- 1
  }
  t <- table(validation_y, pred)
  accur <- cm_2x2_accuracy(t)

  hyper <- list('nstart :', n_start,'iter.max :', max_iter,'algorithm :',algorthm)
  resul <- list("model" = model,'table'= t, "Accuracy" = accur, 'Hyper_parameters' = hyper)
  return (resul)
}
cm_2x2_accuracy <- function (t){
  tp <- t[[1]]
  fp <- t[[2]]
  fn <- t[[3]]
  tn <- t[[4]]
  acuracy <-(tp+tn )/ (tp+fp+fn+tn)
  return (acuracy)
}
wssplot <- function(data, nc=15, seed=123){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of groups",
                     ylab="Sum of squares within a group")}

wssplot(users[c(-18)], nc = 10)

The clusters are fluctuating and mostly at 2 and 6 clusters, ie here is where we see a gradient shift starting and where it starts flattening at. While making a model at 6 clusters, i want to evaluate the performance of the models, hence 2 will be my cluster size.

Given our target is a result which has either a true or false. The goal is to let the model come up with 2 clusters and this coinceides with my choice

options(warn=-1) 
accuracy <- 0
algorithm <- c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
start.n<- c(2 : 17)

iter.max <- c(10,15,20,25,30,35,40,45,50,55,60,65,70)
for (algo in algorithm){
  for (start in start.n){
    for (iter in iter.max){
      # Geting different train and validation sets on each run
      sets <-suppressWarnings(p_ca(users.norm))
      results <- suppressWarnings(k_means(sets$Train,sets$Test,n_start = start,max_iter = iter,algorthm = algo ))
      if (results$Accuracy > accuracy){
        best_table <- results$table
        accuracy <- results$Accuracy
        best_model <- results$model
        best_hypers <- results$Hyper_parameters
      }
    }
  }
}

print(list('Best Accuracy :',accuracy*100,'With Hyperparameters at:',best_hypers))
## [[1]]
## [1] "Best Accuracy :"
## 
## [[2]]
## [1] 80.3608
## 
## [[3]]
## [1] "With Hyperparameters at:"
## 
## [[4]]
## [[4]][[1]]
## [1] "nstart :"
## 
## [[4]][[2]]
## [1] 4
## 
## [[4]][[3]]
## [1] "iter.max :"
## 
## [[4]][[4]]
## [1] 10
## 
## [[4]][[5]]
## [1] "algorithm :"
## 
## [[4]][[6]]
## [1] "Hartigan-Wong"
confusionMatrix(best_table)
## Confusion Matrix and Statistics
## 
##             pred
## validation_y    0    1
##            0 1858  235
##            1  244  102
##                                           
##                Accuracy : 0.8036          
##                  95% CI : (0.7873, 0.8192)
##     No Information Rate : 0.8618          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.1845          
##                                           
##  Mcnemar's Test P-Value : 0.7147          
##                                           
##             Sensitivity : 0.8839          
##             Specificity : 0.3027          
##          Pos Pred Value : 0.8877          
##          Neg Pred Value : 0.2948          
##              Prevalence : 0.8618          
##          Detection Rate : 0.7618          
##    Detection Prevalence : 0.8581          
##       Balanced Accuracy : 0.5933          
##                                           
##        'Positive' Class : 0               
## 

The model did not perform as well as i had hoped it had alot more false negatives than true negatives

Heirachical Clustering

users.scald <- users

d <- dist(users.scald[,-c(18)])
h_clust <- hclust(d, method = "ward.D2")
plot(h_clust)

Selecting the number of clusters to use is a choice made by the user after creating the tree There are 2 main clusters in the dendogram, with the left having a more doinant number of clusters. for 3 clusters : the left main branch and the two branches of the right main branch would form the clusters for 4 clusters : the left main branch and the three branches of the right main branch would form the clusters

But for comparison purpose, i will use the same clusters as in the k means, also for ease of evaluating the performance of the trees

groups <- cutree(h_clust,k=2)

h_users <- users.scald
h_users$Revenue <- groups

table(groups)
## groups
##     1     2 
## 10256  1943

Given there is no method for ‘predict’ applied to an object of class “hclust” meaning the only way to check the performance of the heirachical tree would be to train a supervised model and use that to predict your outputs. However That adds another layer of uncertainty. Say for instance the tree was perfect but the model achieves a 89% accuracy, or the tree had a not so perfect accuracy but the model achieves 98+ % accuracy in the end the result will be weighed down with more than one layer of uncertainty

library('class')
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
k_nn <- function (train_,test_,K=21){
  # getting the sets
  x <- ncol(train_)
  x1 <- x-1
  
  train_x <- train_[,2:x]
  train_y <- train_[[1]]

  validation_x <- test_[,1:x1]
  validation_y <- test_[,x]
  
  val_len <- length(validation_y[validation_y==1])
  pred <- knn(train = train_x, test = validation_x,
                      cl = train_y, k = K)

  t <- table(validation_y,pred)

  accur <- cm_2x2_accuracy(t)

  if ((length(pred[pred==1]))<(val_len/2)){
    accur <- 0
  }# Eliminating a couple of imbalanced models
  hyper <- list('k :', K)
  resul <- list('table'= t,"Accuracy" = accur, 'Hyper_parameters' = hyper)
  return (resul)
}
distances <- c('euclidean', 'maximum','manhattan','canberra','binary','minkowski')
methods <- c('ward.D2','single', 'median', 'average', 'centroid', 'ward.D',  'mcquitty')
accuracy <- 0
for (dist in distances){
  print(dist)
  for (meth in methods){
    for (i in c(5,10,15,20,25,30)){
      sets <- p_ca(users.norm)
      train <- sets$Train
      valid <- sets$Test

      d <- dist(train[,-c(1)],method=dist)
      h_clust <- hclust(d, method = meth)
      groups <- cutree(h_clust,k=2)
      if (length(groups[groups==2]) > length(groups[groups==1])){
        groups[groups==2] <- 0

      }else{
        groups[groups==1] <- 0
        groups[groups==2] <- 1
      }
      
      
      h_users <- train
      h_users$Revenue <- groups

      results <- k_nn(h_users,valid,i)
      
      if (results$Accuracy > accuracy){
            best_table <- results$table
            accuracy <- results$Accuracy
            best_hypers <- results$Hyper_parameters
            best_tree <- h_clust
            best_method <- meth
            best_distance <- dist
      }
    }
  }
}
## [1] "euclidean"
## [1] "maximum"
## [1] "manhattan"
## [1] "canberra"
## [1] "binary"
## [1] "minkowski"
print(list('Best Accuracy :',accuracy*100,'With Hyperparameters at:',best_hypers))
## [[1]]
## [1] "Best Accuracy :"
## 
## [[2]]
## [1] 81.42681
## 
## [[3]]
## [1] "With Hyperparameters at:"
## 
## [[4]]
## [[4]][[1]]
## [1] "k :"
## 
## [[4]][[2]]
## [1] 30
confusionMatrix(best_table)
## Confusion Matrix and Statistics
## 
##             pred
## validation_y    0    1
##            0 1919  130
##            1  323   67
##                                           
##                Accuracy : 0.8143          
##                  95% CI : (0.7983, 0.8295)
##     No Information Rate : 0.9192          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1355          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8559          
##             Specificity : 0.3401          
##          Pos Pred Value : 0.9366          
##          Neg Pred Value : 0.1718          
##              Prevalence : 0.9192          
##          Detection Rate : 0.7868          
##    Detection Prevalence : 0.8401          
##       Balanced Accuracy : 0.5980          
##                                           
##        'Positive' Class : 0               
## 

the model has performed really well classifying the users

It was the best of the two models running here

Comparing The models

Challenging the solution

Given these are three models I gauged to see their performances and tuned them to get the best models in each category, using other classification models and compare them to see which one performs best.

Conclusion

While the heirachical clustering was more accurate however since i kept geting an imbalanced validation set it will not be good enough to evaluate if the model could predict

Follow up questions

Did we have the right data

Yes the data was appropriate for the question asked

Did we have the right questions

Yes the question was right for the data provided